-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized Llama 3.x perf with sharded residual #15142
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
yieldthought
requested review from
cglagovichTT,
mtairum and
uaydonat
as code owners
November 17, 2024 12:47
yieldthought
changed the title
Llama3/sharded residual
Optimized Llama 3.x perf with sharded residual
Nov 17, 2024
Nice ! |
cglagovichTT
approved these changes
Nov 18, 2024
uaydonat
reviewed
Nov 18, 2024
yieldthought
force-pushed
the
llama3/sharded-residual
branch
from
November 20, 2024 08:52
a8d0494
to
9ccefa9
Compare
|
yieldthought
force-pushed
the
llama3/sharded-residual
branch
from
November 20, 2024 11:48
64e575c
to
4b048cf
Compare
mtairum
reviewed
Nov 21, 2024
models/demos/llama3/tt/multimodal/llama_cross_attention_transformer_text.py
Outdated
Show resolved
Hide resolved
mtairum
approved these changes
Nov 21, 2024
…ring N is an even number of tiles
yieldthought
force-pushed
the
llama3/sharded-residual
branch
from
November 22, 2024 16:27
12d480a
to
a3db43c
Compare
Note that the async dispatch revert in commit |
spoojaryTT
pushed a commit
that referenced
this pull request
Nov 25, 2024
### Ticket [14273](#14273) ### Problem description The shared llama3 codebase used interleaved L1 for the residual during decode. This had lower performance than previous specialized models. ### What's changed * Sharded the residual path for all combinations of models and devices * Resolved the L1 corruption issues without deallocate workarounds * Improved performance over all previous specialised models: 8b n150 is now >24t/s/u and 70b t3k is >15 t/s/u ### Checklist - [x] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/12007682870) - [x] [Single card demo tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975122584) - [x] [Single card model perf tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975114297) - [ ] [T3K demo + other tests](https://github.com/tenstorrent/tt-metal/actions/runs/11975098695) ([perf rerun](https://github.com/tenstorrent/tt-metal/actions/runs/12009863881))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket
14273
Problem description
The shared llama3 codebase used interleaved L1 for the residual during decode. This had lower performance than previous specialized models.
What's changed
Checklist